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Abstract 

The classical problem of reliable point-to-point digital communication is to achieve a low probability of error 
while keeping the rate high and the total power consumption small. Traditional information-theoretic analysis 
uses explicit models for the communication channel to study the power spent in transmission. The resulting bounds 
are expressed using 'waterfall' curves that convey the revolutionary idea that unboundedly low probabilities of bit- 
error are attainable using only finite transmit power. However, practitioners have long observed that the decoder 
complexity, and hence the total power consumption, goes up when attempting to use sophisticated codes that operate 
close to the waterfall curve. 

This paper gives an explicit model for power consumption at an idealized decoder that allows for extreme 
parallelism in implementation. The decoder architecture is in the spirit of message passing and iterative decoding 
for sparse-graph codes, but is further idealized in that it allows for more computational power than is currently known 
to be implementable. Generalized sphere-packing arguments are used to derive lower bounds on the decoding power 
needed for any possible code given only the gap from the Shannon limit and the desired probability of error. As 
the gap goes to zero, the energy per bit spent in decoding is shown to go to infinity. This suggests that to optimize 
total power, the transmitter should operate at a power that is strictly above the minimum demanded by the Shannon 
capacity. 

The lower bound is plotted to show an unavoidable tradeoff between the average bit-error probability and the total 
power used in transmission and decoding. In the spirit of conventional waterfall curves, we call these 'waterslide' 
curves. The bound is shown to be order optimal by showing the existence of codes that can achieve similarly shaped 
waterslide curves under the proposed idealized model of decoding. 
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The price of certainty: "waterslide curves" and the 

gap to capacity 

Note: A preliminary version of this work with weaker bounds was submitted to ITW 2008 in Porto [1 ]. 

I. Introduction 

As digital circuit technology advances and we pass into the era of billion-transistor chips, it is clear that the 
fundamental limit on practical codes is not any nebulous sense of "complexity" but the concrete issue of power 
consumption. At the same time, the proposed applications for error-correcting codes continue to shrink in the 
distances involved. Whereas earlier "deep space communication" helped stimulate the development of information 
and coding theory [2], [3], there is now an increasing interest in communication over much shorter distances ranging 
from a few meters [4] to even a few millimeters in the case of inter-chip and on-chip communication [5]. 

The implications of power-consumption beyond transmit power have begun to be studied by the community. The 
common thread in [6]-[10] is that the power consumed in processing the signals can be a substantial fraction of the 
total power. In [11], it is observed that within communication networks, it is worth developing cross-layer schemes 
to reduce the time that devices spend being active. In [9], an information-theoretic formulation is considered. 
When the transmitter is in the 'on' state, its circuit is modeled as consuming some fixed power in addition to 
the power radiated in the transmission itself. Therefore, it makes sense to shorten the overall duration of a packet 
transmission and to satisfy an average transmit-power constraint by bursty signalling that does not use all available 
degrees of freedom. In [7], the authors take into account a peak-power constraint as well, as they study the optimal 
constellation size for uncoded transmission. A large constellation requires a smaller 'on' time, and hence less 
circuit power. However, a larger constellation requires higher power to maintain the same spacing of constellation 
points. An optimal constellation has to balance between the two, but overall this argues for the use of higher rates. 
However, none of these really tackle the role of the decoding complexity itself. 

In [12], the authors take a more receiver-centric view and focus on how to limit the power spent in sampling the 
signal at the receiver. They point out that empirically for ultrawideband systems aiming for moderate probabilities 
of error, this sampling cost can be larger than the decoding cost! They introduce the ingenious idea of adaptively 
puncturing the code at the receiver rather than at the transmitter. They implicitly argue for the use of longer codes 
whose rates are further from the Shannon capacity so that the decoder has the flexibility to adaptively puncture as 
needed and thereby save on total power consumption. 

In [4], the authors study the impact of decoding complexity using the metric of coding gain. They take an 
empirical point of view using power-consumption numbers for certain decoder implementations at moderately low 
probabilities of error. They observe that it is often better to use no coding at all if the communication range is low 
enough. 

In this paper, we take an asymptotic approach to see if considering decoding power has any fundamental 
implications as the average probability of bit error tends to zero. In Section [TTJ we give an asymptotic formulation 
of what it should mean to approach capacity when we must consider the power spent in decoding in addition to that 
spent in transmission. We next consider whether classical approaches to encoding/decoding such as dense linear 
block codes and convolutional codes can satisfy our stricter standard of approaching capacity and argue that they 
cannot. Section [III] then focuses our attention on iterative decoding by message passing and defines the system 
model for the rest of the paper. 

Section [IV] derives general lower bounds to the complexity of iterative decoders for BSC and AWGN channels in 
terms of the number of iterations required to achieve a desired probability of error at a given transmit power. These 
bounds can be considered iterative-decoding counterparts to the classical sphere -packing bounds (see e.g. [13], 
[14]) and are derived by generalizing the delay-oriented arguments of [15], [16] to the decoding neighborhoods in 
iterative decoding. These bounds are then used to show that it is in principle possible for iterative decoders to be a 
part of a weakly capacity-achieving communication system. However, the power spent by our model of an iterative 
decoder must go to infinity as the probability of error tends to zero and so this style of decoding rules out a strong 
sense of capacity-achieving communication systems. 
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We discuss related work in the sparse-graph-code context in Section [V] and make precise the notion of gap to 
capacity before evaluating our lower-bounds on the number of iterations as the gap to capacity closes. We conclude 
in Section [VT] with some speculation and point out some interesting questions for future investigation. 

II. Certainty-achieving codes 

Consider a classical point-to-point AWGN channel with no fading. For uncoded transmission with BPSK sig- 
naling, the probability of bit-error is an exponentially decreasing function of the transmitted energy per symbol. 
To approach certainty (make the probability of bit-error very small), the transmitted energy per symbol must go to 
infinity. If the symbols each carry a small number of bits, then this implies that the transmit power is also going 
to infinity since the number of symbols per second is a nonzero constant determined by the desired rate of R bits 
per second. 

Shannon's genius in [17] was to recognize that while there was no way to avoid having the transmitted energy 
go to infinity and still approach certainty, this energy could be amortized over many bits of information. This meant 
that the transmitted power could be kept finite and certainty could be approached by paying for it using end-to-end 
delay (see [16] for a review) and whatever implementation complexity is required for the encoding and decoding. 
For a given channel and transmit power Pp, there is a maximum rate C(Pt) that can be supported. Turned around, 
this classical result is traditionally expressed by fixing the desired rate R and looking at the required transmit 
power. The resulting "waterfall curves" are showrQ in Figure Q] These sharp curves are distinguished from the 
more gradual "waterslide curves" of uncoded transmission. 




Fig. 1. The Shannon waterfalls: plots of log((P e )) vs required SNR (in dB) for a fixed rate-1/3 code transmitted using BPSK over an 
AWGN channel with hard decisions at the detector. A comparison is made with the rate-1/3 repetition code: uncoded transmission with the 
same bit repeated three times. Also shown is the waterfall curve for the average power constrained AWGN channel. 



Traditionally, a family of codes was considered capacity achieving if it could support arbitrarily low probabilities 
of error at transmit powers arbitrarily close to that predicted by capacity. The complexity of the encoding and 
decoding steps was considered to be a separate and qualitatively distinct performance metric. This makes sense 

'Since the focus of this paper is on average bit error probability, these curves combine the results of [17], [18] and adjust the required 
capacity by a factor of the relevant rate-distortion function 1 — hb({Pe)). 
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when the communication is long-range, since the "exchange rate" between transmitter power and the power that 
ends up being delivered to the receiver is very poor due to distance-induced attenuation. 

In light of the advances in digital circuits and the need for shorter-range communication, we propose a new way 
of formalizing what it means for a coding approach to be "capacity achieving " using the single natural metric: 
power. 

A. Definitions 

Assume the traditional information-theoretic model (see e.g. [13], [19]) of fixed-rate discrete-time communication 
with k total information bits, m channel uses, and the rate of R = — bits per channel use. As is traditional, the 
rate R is held constant while k and m are allowed to become asymptotically large. (P e ,i) is the average probability 
of bit error on the z-th message bit and (P e ) = \ ^ (P e ,i) is used to denote the overall average probability of bit 
error. No restrictions are assumed on the codebooks aside from those required by the channel model. The channel 
model is assumed to be indexed by the power used in transmission. The encoder and decoder are assumed to be 
physical entities that consume power according to some model that can be different for different codes. 

Let CtPt be the actual power used in transmission and let Pc and Pjj be the power consumed in the operation 
of the encoder and decoder respectively. £t is the exchange rate (total path-loss) that connects the power spent at 
the transmitter to the received power Pt that shows up at the receiver. In the spirit of [10], we assume that the 
goal of the system designer is to minimize some weighted combination Ptntai = £,tPt + CcPc + CdPd where the 
vector £ > 0. The weights can be different depending on the applications and £r is tied to the distance between 
the transmitter and receiver as well as the propagation environment. 

For any rate R and average probability of bit error (P e ) > 0, we assume that the system designer will minimize the 
weighted combination above to get optimized PtotaliC (Pe},R) as well as constituent Pt{£, (P e ),R), Pc(£, (Pe), P), 
and P D (l(P e ),R). 

Definition 1: The certainty of a particular encoding and decoding system is the reciprocal of the average 
probability of bit error. 

Definition 2: An encoding and decoding system at rate R bits per second is weakly certainty achieving if 
liminf(p e ^ Pt(£, (P e ),R) < oo for all weights £ > 0. 

If an encoder/decoder system is not weakly certainty achieving, then this means that it does not deliver on the 
revolutionary promise of the Shannon waterfall curve from the perspective of transmit power. Instead, such codes 
encourage system designers to pay for certainty using unbounded transmission power. 

Definition 3: An encoding and decoding system at rate R bits per second is strongly certainty achieving if 
liminfyp^o Ptotal(£, (Pe),R) / oo for all weights £ > 0. 

A strongly certainty-achieving system would deliver on the full spirit of Shannon's vision: that certainty can 
be approached at finite total power just by accepting longer end-to-end delays and amortizing the total energy 
expenditure over many bits. The general distinction between strong and weak certainty-achieving systems relates to 
how the decoding power Pd(^, (P e ), R) varies with the probability of bit-error (P e ) for a fixed rate R. Does 
it have waterfall or waterslide behavior? For example, it is clear that uncoded transmission has very simple 
encoding/decoding^ and so Pod, {Pe), R) has a waterfall behavior. 

Definition 4: A { weakly | strongly} certainty-achieving system at rate R bits per second is also {weakly\strongly} 
capacity achieving if 

liminf liminf P T (£ (P e ),R) = C"" 1 ^) (1) 

where C^ 1 (R) is the minimum transmission power that is predicted by the Shannon capacity of the channel model. 

2 For example, in an RFID application, the power used by the tag is actually supplied wirelessly by the reader. If the tag is the decoder, 
then it is natural to make £d even larger than £t in order to account for the inefficiency of the power transfer from the reader to the 
tag. One-to-many transmission of multicast data is another example of an application that can increase The £d in that case should be 
increased in proportion to the number of receivers that are listening to the message. 

3 A11 that is required is the minimum power needed to sample the received signal and threshold the result. 
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This sense of capacity achieving makes explicit the sense in which we should consider encoding and decoding 
to be asymptotically free, but not actually free. The traditional approach of modeling encoding and decoding as 
being actually free can be recovered by swapping the order of the limits in £[]). 

Definition 5: An encoding and decoding system is considered traditionally capacity achieving if 

liminf liminf P T (£ (P e ),R) = C _1 (i2). (2) 

(P e )— £ o ,^0* 

where C^ 1 (R) is the minimum transmission power that is predicted by the Shannon capacity of the channel model. 

By taking the limit (£c)£d) ^ for a fixed probability of error, this traditional approach makes it impossible 
to capture any fundamental tradeoff with complexity in an asymptotic sense. 

The conceptual distinction between the new £[)) and old (0 senses of capacity-achieving systems parallels 
Shannon's distinction between zero-error capacity and regular capacity [20]. If C(e,d) is the maximum rate that 
can be supported over a channel using end-to-end delay d and average probability of error e, then traditional 
capacity C = lim e ^o lim<f-+oo C(e, d) while zero-error capacity Co = lim^oo lim e _*o C(e, d). When the limits 
are taken together in some balanced way, then we get concepts like anytime capacity [16], [21]. It is known that 
Co < Cany < C in general and so it is natural to wonder whether any codes are capacity achieving in the new 
stricter sense of Definition [4] 



B. Are classical codes capacity achieving? 

1 ) Dense linear block codes with nearest-neighbor decoding: Dense linear fixed-block-length codes are tradi- 
tionally capacity achieving under ML decoding [13]. To understand whether they are weakly certainty achieving, we 
need a model for the encoding and decoding power. Let m be the block length of the code. Each codeword symbol 
requires mR operations to encode and it is reasonable to assume that each operation consumes some energy. Thus, 
the encoding power is 0(m). Meanwhile, a straightforward implementation of ML (nearest-neighbor) decoding 
has complexity exponential in the block-length and thus it is reasonable to assume that it consumes an exponential 
amount of power as well. 

The probability of error for ML decoding drops exponentially with m with an exponent that is bounded above by 
the sphere -packing exponent E sp (R) [13]. An exponential reduction in the probability of error is thus paid for using 
an exponential increase in decoding power. Consequently, it is easy to see that the certainty return on investments in 
decoding power is only polynomial. Meanwhile, the certainty return on investments in transmit power is exponential 
even for uncoded transmission. So no matter what the values are for £d > 0, in the high-certainty limit of very low 
probabilities of error, an optimized communication system built using dense linear block codes will be investing 
ever increasing amounts in transmit power. 

A plot of the resulting waterslide curves for both transmit power and decoding power are given in Figure [2] 
Following tradition, the horizontal axes in the plots are given in normalized SNR units for power. Notice how the 
optimizing system invests heavily in additional transmit power to approach low probabilities of error. 

2) Convolutional codes under Viterbi decoding: For convolutional codes, there are two decoding algorithms, and 
hence two different analyses. (See [22], [23] for details) For Viterbi decoding, the complexity per-bit is exponential 
in the constraint length RL C bits. The error exponents with the constraint length of L c channel uses are upper- 
bounded in [24], and this bound is given parametricaily by 

Econv(R, Ft) = E (p, P T ); R= E °^ Pt) (3) 

P 

where E$ is the Gallager function [13] and p > 0. The important thing here is that just as in dense linear block 
codes, the certainty return on investments in decoding power is only polynomial, albeit with a better polynomial 
than linear block-codes since E conv (R, Pt) is higher than the sphere-packing bound for block codes [13]. Thus, 
an optimized communication system built using Viterbi decoding will also be investing ever increasing amounts in 
transmit power. Viterbi decoding is not weakly certainty achieving. 

A plot of the resulting waterslide curves for both transmit power and decoding power is given in Figure [3] 
Notice that the performance in Figure [3] is better than that of Figure [H This reflects the superior error exponents 
of convolutional codes with respect to their computational parameter — the constraint length. 
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Fig. 2. The waterslide curves for transmit power, decoding power, and the total power for dense linear block-codes of rate R = 1/3 
under brute-force ML decoding. It is assumed that the normalized energy required per operation at the decoder is E = 0.3 and that it takes 
2 mR x mR operations per channel output to decode using nearest-neighbor search for a block length of m channel uses. 



3) Convolutional codes under magical sequential decoding: For convolutional codes with sequential decoding, 
it is shown in [25] that the average number of guesses must increase to infinity if the message rate exceeds the 
cut-off rate, E$(l). However, below the cut-off rate, the average number of guesses is finite. Each guess at the 
decoder costs L C R multiply-accumulates and we assume that this means that average decoding power also scales 
as 0(L C ) since at least one guess is made for each received sample. 

For simplicity, let us ignore the issue of the cut-off rate and further assume that the decoder magically makes 
just one guess and always gets the ML answer. The convolutional coding error exponent ([3]) still applies, and so 
the system's certainty gets an exponential return for investments in decoding power. It is now no longer obvious 
how the optimized-system will behave in terms of transmit power. 

For the magical system, the encoder power and decoder power are both linear in the constraint-length. Group 
them together with the path-loss and normalize units to get a single effective term "yLc- The goal now is to minimize 

Pt + lL c (4) 

over Pt and Lq subject to the probability of error constraint that In = E conv (R, Pt)^t- Since we are interested 
in the limit of In -jhx — > oo, it is useful to turn this around and use Lagrange multipliers. A little calculation reveals 
that the optimizing values of Pt and L c must satisfy the balance condition 

E^{R,P T ) = 1 Lc 9E ™;ff T) (5) 
and so (neglecting integer-effects) the optimizing constraint-length is either 1 (uncoded transmission) or 

L c = l -E conv {R,P T )/ dEcon : { ^ Pt) . (6) 
7 oPt 

To get ever lower values of (P e ), the transmit power Pt must therefore increase unboundedly unless the ratio 
Econv (R> Pt) I dEco ™^ PT ) approaches infinity for some finite Pt- Since the convolutional coding error exponent 
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Power 

Fig. 3. The waterslide curves for transmit power, decoding power, and the total power for convolutional codes of rate R — 1/3 used with 
Viterbi decoding. It is assumed that the normalized energy required per operation at the decoder is E = 0.3 and that it takes 2 L " R x L C R 
operations per channel output to decode using Viterbi search for a constraint length of L c channel uses. 

yj) does not go to infinity at a finite power, this requires — c °gp T 1 to approach zero. For AWGN style channels, 
this only occur^ as Pt approaches infinity and thus the gap between R and the capacity gets large. 

The resulting plots for the waterslide curves for both transmit power and encoding/decoding power are given in 
Figure |4] Although these plots are much better than those in Figure [3] the surprise is that even such a magical 
system that attains an error-exponent with investments in decoding power is unable to be weakly certainty achieving 
at any rate. Instead, the optimizing transmit power goes to infinity. 

4 ) Dense linear block codes with magical syndrome decoding: It is well known that linear codes can be decoded 
by looking at the syndrome of the received codeword [13]. Suppose that we had a magical syndrome decoder that 
could use a free lookup table to translate the syndrome into the ML corrections to apply to the received codeword. 
The complexity of the decoding would just be the complexity of computing the syndrome. For a dense random 
linear block code, the parity-check matrix is itself typically dense and so the per-channel-output complexity of 
computing each bit of the syndrome is linear in the block-length. This gives rise to behavior like that of magical 
sequential decoding above and is illustrated in Figure [5] 

From the above discussion, it seems that in order to have even a weakly certainty-achieving system, the certainty- 
return for investments in encoding/decoding power must be faster than exponential! 

III. Parallel iterative decoding: a new hope 

The unrealistic magical syndrome decoder suggests a way forward. If the parity-check matrix were sparse, then 
it would be possible to compute the syndrome using a constant number of operations per received symbol. If the 
probability of error dropped with block-length, that would give rise to an infinite -return on investments in decoder 

4 There is a slightly subtle issue here. Consider random codes for a moment. The convolutional random-coding error exponent is flat at 
Eo(l, Pt) for rates R below the computational cutoff rate. However, that flatness with rate R is not relevant here. For any fixed constellation, 
the Eq(1,Pt) is a strictly monotonically increasing function of Pt, even though it asymptotes at a non-infinite value. This is not enough 
since the derivative with transmit power still tends to zero only as Pt goes to infinity. 
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Fig. 4. The waterslide curves for transmit power, decoding power, and the total power for convolutional codes of rate R — 1/3 used 
with "magical" sequential decoding. It is assumed that the normalized energy required per operation at the decoder is E — 0.3 and that the 
decoding requires just L C R operations per channel output. 



power. This suggests looking in the direction of LDPC codes [26]. While magical syndrome decoding is unrealistic, 
many have observed that message -passing decoding gives good results for such codes while being implementable 
[27]. 

Upon reflection, it is clear that parallel iterative decoding based on message passing holds out the potential for 
super-exponential improvements in probability of error with decoding power. This is because messages can reach 
an exponential-sized neighborhood in only a small number of iterations, and large-deviations thinking suggests 
that there is the possibility for an exponential reduction in the probability of error with neighborhood size. In fact, 
exactly this sort of double-exponential reduction in the probability of error under iterative decoding has been shown 
to be possible for regular LDPCs [28, Theorem 5]. 

To make all this precise, we need to fix our model of the problem and of an implementable decoder. Consider 
a point-to-point communication link. An information sequence is encoded into 2 mR codeword symbols X™, 
using a possibly randomized encoder. The observed channel output is Y™. The information sequences are assumed 
to consist of iid fair coin tosses and hence the rate of the code is R = k/m. Following tradition, both k and m 
are considered to be very large. We ignore the complexity of doing the encoding under the hope that encoding is 
simpler than decoding!! 

Two channel models are considered: the BSC and the power-constrained AWGN channel. The true channel is 
always denoted P. The underlying AWGN channel has noise variance a 2 p and the average received power is denoted 

Pt so the received SNR is £f. Similarly, we assume that the BSC has crossover probability p. We consider the 

dp 

BSC to have resulted from BPSK modulation followed by hard-decision detection on the AWGN channel and so 

For maximum generality, we do not impose any a priori structure on the code itself. Instead, inspired by [30]- 

5 For certain LDPC-codes, it is shown in [29] that encoding can be made to have complexity linear in the block-length for a certain model 
of encoding. In our context, linear complexity means that the complexity per data bit is constant and thus this does not require power at the 
encoder that grows with either the block length or the number of decoder iterations. We have not yet verified if the complexity of encoding 
is linear under our computational model. 
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Power 



Fig. 5. The waterslide curves for transmit power, decoding power, and the total power for dense linear block-codes of rate R = 1/3 under 
magical syndrome decoding. It is assumed that the normalized energy required per operation at the decoder is E = 0.3 and that the decoding 
requires just (1 — R)mR operations per channel output to compute the syndrome. 



[33], we focus on the parallelism of the decoder and the energy consumed within it. We assume that the decoder 
is physically made of computational nodes that pass messages to each other in parallel along physical (and hence 
unchanging) wires. A subset of nodes are designated 'message nodes' in that each is responsible for decoding 
the value of a particular message bit. Another subset of nodes (not necessarily disjoint) has members that are 
each initialized with at most one observation of the received channel-output symbols. There may be additional 
computational nodes that are just there to help decode. 

The implementation technology is assumed to dictate that each computational node is connected to at most 
a + 1 > 2 other node^ with bidirectional wires. No other restriction is assumed on the topology of the decoder. In 
each iteration, each node sends (possibly different) messages to all its neighboring nodes. No restriction is placed 
on the size or content of these messages except for the fact that they must depend on the information that 
has reached the computational node in previous iterations. If a node wants to communicate with a more distant 
node, it has to have its message relayed through other nodes. No assumptions are made regarding the presence 
or absence of cycles in this graph. The neighborhood size at the end of I iterations is denoted by n < a +1 . We 
assume m 3> n. Each computational node is assumed to consume a fixed E no d e joules of energy at each iteration. 

Let the average probability of bit error of a code be denoted by (P e }p when it is used over channel P. The goal 
is to derive a lower bound on the neighborhood size n as a function of (P e )p and R. This then translates to a 
lower bound on the number of iterations which can in turn be used to lower bound the required decoding power. 

Throughout this paper, we allow the encoding and decoding to be randomized with all computational nodes 
allowed to share a common pool of common randomness. We use the term 'average probability of error' to refer 
to the probability of bit error averaged over the channel realizations, the messages, the encoding, and the decoding. 

6 In practice, this limit could come from the number of metal layers on a chip, a — 1 would just correspond to a big ring of nodes and is 
uninteresting for that reason. 
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IV. Lower bounds on decoding complexity: iterations and power 

In this section, lower bounds are stated on the computational complexity for iterative decoding as a function of the 
gap from capacity. These bounds reveal that the decoding neighborhoods must grow unboundedly as the system tries 
to approach capacity. We assume the decoding algorithm is implemented using the iterative technology described in 
Section [TlTJ The resulting bounds are then optimized numerically to give plots of the optimizing transmission and 
decoding powers as the average probability of bit error goes to zero. For transmit power, it is possible to evaluate 
the limiting value as the system approaches certainty. However, decoding power is shown to diverge to infinity for 
the same limit. This shows that the lower bound does not rule out weakly capacity-achieving schemes, but strongly 
capacity-achieving schemes are impossible using Section [Illjs model of iterative decoding. 

A. Lower bounds on the probability of error in terms of decoding neighborhoods 

The main bounds are given by theorems that capture a local sphere-packing effect. These can be turned around 
to give a family of lower bounds on the neighborhood size n as a function of (P e )p- This family is indexed by the 
choice of a hypothetical channel G and the bounds can be optimized numerically for any desired set of parameters. 

Theorem 1: Consider a BSC with crossover probability p < |. Let n be the maximum size of the decoding 
neighborhood of any individual bit. The following lower bound holds on the average probability of bit error. 

c-i{R)<g<\ 2 \gQ--P)J 

where hb(-) is the usual binary entropy function, D(g\\p) = <?log 2 + (1 — g) log 2 fl=§J * s trie usua l KL- 
divergence, and 

5(G) = (8) 



R 

where C(G) = 1 - h b (g) 



and e 



1 



\K(g) 




(9) 



where K(g) = inf £^+jM. (10) 
Proof: See Appendix J] ■ 

Theorem 2: For the AWGN channel and the decoder model in Section UTTJ let n be the maximum size of the 
decoding neighborhood of any individual message bit. The following lower bound holds on the average probability 
of bit error. 

(Pe) P > sup £ f G)) exp (-nDip&K) - yfr ff + 2 In (4 - *)) (") 

*l:C(G)<R 2 \ V 2 \ h b ( S (G)) J J \ a P J J 

where 5(G) = 1 - C(G)/R, the capacity C(G) = ilog 2 (l + ff), and the KL divergence D(a^\\a 2 p ) = 



? - 1 - In ( ^ 

The following lower bound also holds on the average probability of bit error 



(Pe) P > sup kbl{S{G)) e W (-nD(a 2 G \\a 2 P ) - U(n,h~ l (5(G))) f^-l)), (12) 

a 2 a >a 2 P ti{n): C(G)<R 1 V 1 \°> / / 



where 



I, 1 4T(n) + 2 

2 T(n) + 1 nT(n)(l + T(n)) 
where T(n) = -W L (- exp(-l)(l/4) 1/n ) (14) 
and Wl(x) solves x = Wl(x) exp(Wi(x)) (15) 
while satisfying Wl(x) < — 1 Vx G [— exp(— 1), 0], 



10 



and 

<j,{n,y) = -n(W L (- exp(-l)(|)^) + 1). (16) 

The Wl(x) is the transcendental Lambert W function [34] that is defined implicitly by the relation (fl~5l) above. 
Proof: See Appendix ITT1 ■ 
The expression (fl2l ) is better for plotting bounds when we expect n to be moderate while (fTT~b is more easily 
amenable to asymptotic analysis as n gets large. 



B. Joint optimization of the weighted total power 

Consider the total energy spent in transmission. For transmitting k bits at rate R, the number of channel uses is 
m = k I R. If each transmission has power £,tPt, the total energy used in transmission is ^tPt^- 

At the decoder, let the number of iterations be I. Assume that each node consumes E no d e joules of energy in 
each iteration. The number of computational nodes can be lower bounded by the number m of received channel 
outputs. 

Edec > E node x m x I. (17) 

This gives a lower bound of Prj > E no d e l for decoder power. There is no lower bound on the encoder complexity 
and so the encoder is considered free. This results in the following bound for the weighted total power 

Ptotal > ZtPt + Z D E node X I. (18) 
Using / > ? og2 ^ as the natural lower bound on the number of iterations given a desired maximum neighborhood 

& — log 2 (a) & & 

size, 

p *. c p , iDE node log 2 (n) 

Ptotal > ZTPT + 



log 2 (a) 

cc ^ + 7 l og2 (n) (19) 

Op 

where 7 = ^? E x n ° d ", \ is a constant that summarizes all the technology and environmental terms. The neighborhood 
size n itself can be lower bounded by plugging the desired average probability of error into Theorems QJ and |2J 

It is clear from ( fT9l that for a given rate R bits per channel use, if the transmit power P? is extremely close to that 
predicted by the channel capacity, then the value of n would have to be extremely large. This in turn implies that 
there are a large number of iterations and thus it would require high power consumption at the decoder. Therefore, 
the optimized encoder has to transmit at a power larger than that predicted by the Shannon limit in order to decrease 
the power consumed at the decoder. Also, from ([7]), as (P e ) — > 0, the required neighborhood size n — > 00. This 
implies that for any fixed value of transmit power, the power consumed at the decoder diverges to infinity as the 
probability of error converges to zero. Hence the total power consumed must diverge to infinity as the probability 
of error converges to zero. This immediately rules out the possibility of having a strongly certainty-achieving code 
using this model of iterative decoding. The price of certainty is infinite power. The only question that remains is 
whether the optimal transmitter power can remain bounded or not. 

The optimization can be performed numerically once the exchange rate £t is fixed, along with the technology 
parameters E no d e , a, £cs Cd- Figures [6] and |7] show the total-power waterslide curves for iterative decoding assuming 
the lower boundsQ These plots show the effect of changing the relative cost of decoding. The waterslide curves 
become steeper as decoding becomes cheaper and the plotted scale is chosen to clearly illustrate the double- 
exponential relationship between decoder power and probability of error. 

Figure [8] fixes the technology parameters and breaks out the optimizing transmit power and decoder power as 
two separate curves. It is important to note that only the weighted total power curve is a true bound on what a real 
system could achieve. The constituent Pt and Prj curves are merely indicators of what the qualitative behaviour 

7 The order-of-magnitude choice of 7 = 0.3 was made using the following numbers. The energy cost of one iteration at one node 
E n ode ~ lpJ (optimistic extrapolation from the reported values in [4], [12]), path-loss £t ~ 86dB corresponding to a range in the tens of 
meters, thermal noise energy per sample <rf> ~ 4 x 10~ 21 J from kT with T around room temperature, and computational node connectivity 
a = 4. 
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Fig. 6. The BSC Waterslides: plots of Iog((P e )) vs bounds on required total power for any fixed rate-1/3 code transmitted over an AWGN 
channel using BPSK modulation and hard decisions. 7 = £,DE noc ie/(£,TO'% log 2 (a)) denotes the normalized energy per node per iteration 
in SNR units. Total power takes into account the transmit power as well as the power consumed in the decoder. The Shannon limit is a 
universal lower bound for all 7. 



would be if the true tradeoff behaved like the lower boundj^l The optimal transmit power approaches a finite limit 
as the probability of error approaches 0. This limit can be calculated directly by examining © for the BSC and 
CD) for the AWGN case. 

To compute this limit, recall that the goal is to optimize % + 7 log 2 (n) over Pp so as to satisfy a probability 
of error constraint (P e ), where the probability of error is tending to zero. Instead of constraining the probability 
of error to be small, it is just as valid to constrain 7 log log rp-c to be large. Now, take the logarithm of both 
sides of ([7]) (or similarly for (fTTIO . It is immediately clear that the only order n term is the one that multiplies 
the divergence. Since n — > 00 as (P e ) — > 0, this term will dominate when a second logarithm is taken. Thus, 
we know that the bound on the double logarithm of the certainty 7 log log -r^x — ► 7iog 2 (n) + 7 log f(R, ^P) 
where f(R, p 1 ) = D(G\\P) is the divergence expression involving the details of the channel. It turns out that G 
approaches C~ 1 (R) when (P e ) — ► since the divergence is maximized there. 

Optimizing for ( = 5f by taking derivatives and setting to zero gives: 

f(R, 0/^^=7- (20) 

It turns out that this has a unique root ((R, 7) for all rates R and technology factors 7 for both the BSC and the 
AWGN channel. 

The key difference between §5§ and (l20l) is that no term that is related to the neighborhood-size or number of 
iterations has survived in (l20l) . This is a consequence of the double-exponential reduction in the probability of 

8 This doesn't mean that the bound is useless however. A lower bound on the transmit power can be computed once any implementable 
scheme exists. Simply look up where the bounded total power matches the implementable scheme. This will immediately give rise to lower 
bounds on the optimal transmit and decoding powers. 

9 In fact, it is easy to verify that anything faster than double-exponential will also work. 
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Fig. 7. The AWGN Waterslides: plots of log((P e )) vs bounds on required total power for any fixed rate-1/3 code transmitted over an AWGN 
channel. The initial segment where all the waterslide curves almost coincide illustrates the looseness of the bound since that corresponds to 
the case of n — 1 or when the bound suggests that uncoded transmission could be optimal. However, the probability of error is too optimistic 
for uncoded transmission. 



error with the number of iterations and the fact that the transmit power shows up in the outer and not the inner 
exponential. 

To see if iterative decoding allows weakly capacity-achieving codes, we take the limit of £d — ► which implies 
7 — ► 0. (|20T ) then suggests that we need to solve f(R, C V^^'^ = which implies that either the numerator is 
zero or the denominator becomes infinite. For AWGN or BSC channels, the slope of the error exponent f(R, Q is 
monotonically decreasing as the SNR ( — > oo and so the unique solution is where f(R,Q = D(C~ 1 (R) \ \Pt) = 0. 
This occurs at P? = C^ 1 (R) and so the lower bounds of this section do not rule out weakly capacity-achieving 
codes. 

In the other direction, as the 7 term gets large, the Pt{R,^) increases. This matches the intuition that as the 
relative cost of decoding increases, more power should be allocated to the transmitter. This effect is plotted in 
Figure [9] Notice that it becomes insignificant when 7 is very small (long-range communication) but becomes 
non-negligible whenever the 7 exceeds 0.1. 

Figure [10] illustrates how the effect varies with the desired rate R. The penalty for using low-rate codes is quite 
significant and this gives further support to the lessons drawn from [7], [9] with some additional intuition regarding 
why it is fundamental. The error exponent governing the probability of error as a function of the neighborhood 
size is limited by the sphere -packing bound at rate - this is finite and the only way to increase it is to pay more 
transmit power. However, the decoding power is proportional to the number of received samples and this is larger 
at lower rates. 

Finally, the plots were all made assuming that the neighborhood size n could be chosen arbitrarily and the 
number of iterations could be a real number rather than being restricted to integer values. This is fine when the 
desired probability of error is low, but it turns out that this integer effect cannot be neglected when the tolerable 
probability of error is high. This is particularly significant when 7 is large. To see this, it is useful to consider 
the boundary between when uncoded transmission is optimal and when coding might be competitive. This is done 
in Figure \TT\ where the minimum 7 log 2 (a) power required for the first decoder iteration is instead given to the 



13 




Power 



Fig. 8. The BSC Waterslide curve for 7 = 0.3, R = 1/3. An upper bound (from Section HV-Cb that is parallel to the lower bound is also 
shown along with the heuristically optimal transmit power. This transmit power is larger than that predicted by the Shannon limit for small 
probabilities of error. This suggests that the transmitter has to make accommodations for the decoder complexity in order to minimize the 
total power consumption. 



transmitter. Once 7 > 10, it is hard to beat uncoded transmission unless the desired probability of error is very low 
indeed. 

C. Upper bounds on complexity 

It is unclear how tight the lower bounds given earlier in this section are. The most shocking aspect of the lower 
bounds is that they predict a double exponential improvement in probability of error with the number of iterations. 
This is what is leading to the potential for weakly capacity-achieving codes. To see the order-optimality of the 
bound in principle, we will "cheat" and exploit the fact that our model for iterative decoding in Section [III] does 
not limit either the size of the messages or the computational power of each node in the decoder. This allows us 
to give upper bounds on the number of iterations required for a given performance. 

Theorem 3: There exists a code of rate R < C such that the required neighborhood size to achieve (P e ) average 
probability of error is upper bounded by 

^ E r (R) (21) 

where E r {R) is the random-coding error exponent for the channel [13]. The required number of iterations to achieve 
this neighborhood size is bounded above by 

Z-2<2 lQg2 ^. (22) 
log 2 (a) 

Proof: This "code" is basically an abuse of the definitions. We simply use a rate-i? random code of length n 
from [13] where each code symbol is drawn iid. Such random codes if decoded using ML decoding satisfy 



(Pe) P < (PeVock < exp(-nE r (R)). 



(23) 
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Fig. 9. The impact of 7 on the heuristically predicted optimum transmit power for the BSC used at R — |. The plot shows the gap from 
the Shannon prediction in a factor sense. 



The decoder for each bit needs at most n channel-output symbols to decode the block (and hence any particular 
bit). 

Now it is enough to show an upper bound on the number of iterations, I. Consider a regular tree structure imposed 
on the code with a branching factor of a and thus overall degree a + 1. Since the tree would have a d nodes in it at 
depth d, a required depth of d = |°| 2 |^ + 1 is sufficient to guarantee that everything within a block is connected. 

Designate some subset of computational nodes as responsible for decoding the individual message bits. At each 
iteration, the "message" transmitted by a node is just the complete list of its own observation plus all the messages 
that node has received so far. Because the diameter of a tree is no more than twice its depth, at the end of 2d 
iterations, all the nodes will have received all the values of received symbols in the neighborhood. They can then 
each ML decode the whole block, with average error probability given by (|23T ). The result follows. ■ 

For both the AWGN channel and BSC, this bound recovers the basic behavior that is needed to have the 
probability of error drop doubly-exponentially in the number of iterations. For the BSC, it is also clear that since 
E r (R) = D(C~ 1 (R)\\p) for rates R in the neighborhood of capacity, the upper and lower bounds essentially agree 
on the asymptotic neighborhood size when (P e ) — > 0. The only difference comes in the number of iterations. This 
is at most a factor of 2 and so has the same effect as a slightly different £0 in terms of the shape of the curves 
and optimizing transmit power. 

We note here that this upper bound points to the fact that the decoding model of Section [III] is too powerful 
rather than being overly constraining. It allows free computations at each node and unboundedly large messages. 
This suggests that the lower bounds are relevant, but it is unclear whether they are actually attainable with any 
implementable code. We delve further into this in Section [VT] 

V. The gap to capacity and related work 

Looking back at our bounds of Theorems Q] and |2j they seem to suggest that a certain minimum number 
(log Q f(R, Pt)) of iterations are required and after that, the probability of error can drop doubly exponentially 
with additional iterations. This parallels the result of [28, Theorem 5] for regular LDPCs that essentially implies 
that regular LDPCs can be considered weakly certainty-achieving codes. However, our bounds above indicate that 
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Rate 

Fig. 10. The impact of rate R on the heuristically predicted optimum transmit power for 7 = 0.3. The plot shows the Shannon minimum 
power, our predictions, and the ratio of the difference between the two to the Shannon minimum. Notice that the predicted extra power is 
very substantial at low data rates. 



iterative decoding might be compatible with weakly capacity-achieving codes as well. Thus, it is interesting to ask 
how the complexity behaves if we operate very close to capacity. Following tradition, denote the difference between 
the channel capacity C(P) and the rate R as the gap = C{P) — R. 

Since our bounds are general, it is interesting to compare them with the existing specialized bounds in the vicinity 
of capacity. After first reviewing a trivial bound in Section IV-AI to establish a baseline, we review some key results 
in the literature in Section IV-BI Before we can give our results, we take another look at the waterfall curve in 
Figure Q] and notice that there are a number of ways to approach the Shannon limit. We discuss our approach in 
Section IV-CI before giving our lower bounds to the number of iterations in Section IV-DI 

A. The trivial bound for the BSC 

Given a crossover probability p, it is important to note that there exists a semi-trivial bound on the neighborhood 
size that only depends on the (P e ). Since there is at least one configuration of the neighborhood that will decode 
to an incorrect value for this bit, it is clear that 

(P e ) > P n . (24) 

This implies that the number of computational iterations for a code with maximum decoding degree a + 1 is lower 

log log — J— — log log - 

bounded by Toga ~ - ^ n ^ s trivia bound does not have any dependence on the capacity and so does not 

capture the fact that the complexity should increase inversely as a function of gap as well. 

B. Prior work 

There is a large literature relating to codes that are specified by sparse graphs. The asymptotic behavior as these 
codes attempt to approach Shannon capacity is a central question in that literature. For regular LDPC codes, a result 
in Gallager's Ph.D. thesis [26, Pg. 40] shows that the average degree of the graph (and hence the average number 
of operations per iteration) must diverge to infinity in order for these codes to approach capacity even under ML 
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Fig. 11. The probability of error below which coding could potentially be useful. This plot assumes an AWGN channel used with BPSK 
signaling and hard-decision detection, target message rate R = i, and an underlying iterative-decoding architecture with a — 3. This plot 
shows what probability of error would be achieved by uncoded transmission (repetition coding) if the transmitter is given extra power beyond 
that predicted by Shannon capacity. This extra power corresponds to that required to run one iteration of the decoder. Once 7 gets large, 
there is effectively no point in doing coding. 



decoding. It turns out that it is not hard to specialize our Theorem Q] to regular LDPC codes and have it become 
tighter along the way. Such a modified bound would show that as the gap from Gallager's rate bound converges to 
zero, the number of iterations must diverge to infinity. However, it would permit double-exponential improvements 
in the probability of error as the number of iterations increased. 

More recently, in [35, Pg. 69] and [36], Khandekar and McEliece conjectured that for all sparse-graph codes, 
the number of iterations must scale either multiplicatively as 



in the near neighborhood of capacity. Here we use the VL notation to denote lower-bounds in the order sense of 
[37]. This conjecture is based on a graphical argument for the message-passing decoding of sparse-graph codes 
over the BEC. The intuition was that the bound should also hold for general memoryless channels, since the BEC 
is the channel with the simplest decoding. 

Recently, the authors in [38] were able to formalize and prove a part of the Khandekar-McEliece conjecture 
for three important families of sparse-graph codes, namely the LDPC codes, the Accumulate-Repeat-Accumulate 
(ARA) codes, and the Irregular-Repeat Accumulate (IRA) codes. Using some remarkably simple bounds, the authors 
demonstrate that the number of iterations usually scales as for Binary Erasure Channels (BECs). If, however, 

the fraction of degree-2 nodes for these codes converges to zero, then the bounds in [38] become trivial. The authors 
note that all the known traditionally capacity-achieving sequences of these code families have a non-zero fraction 
of degree-2 nodes. 




(25) 



or additively as 




(26) 
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In addition, the bounds in [38] do not imply that the number of decoding iterations must go to infinity as 
(P e ) — ► 0. So the conjecture is not yet fully resolved. We can observe however that both of the conjectured bounds 
on the number of decoding iterations have only a singly-exponential dependence of the probability of error on 
the number of iterations. The multiplicative bound (l26l ) behaves like a block or convolutional code with an error- 
exponent of K x gap and so, by the arguments of Section III-B.3I is not compatible with such codes being weakly 
capacity achieving in our sense. However, it turns out that the additive bound d25l ) is compatible with being weakly 
capacity achieving. This is because the main role of the double-exponential in our derivation is to allow a second 
logarithm to be taken that decoupled the term depending on the transmit power from the one that depends on the 
probability of error. The conjectured additive bound (|25T ) has that form already. 



C. 'Gap' to capacity 

In the vicinity of capacity, the complication is that for any finite probability of bit-error, it is in principle possible 
to communicate at rates above the channel capacity. Before transmission, the k bits could be lossily compressed 
using a source code to « (1 — hb((P e }))k bits. The channel code could then be used to protect these bits, and the 
resulting codeword transmitted over the channel. After decoding the channel code, the receiver could in principle 
use the source decoder to recover the message bits with an acceptable average probability of bit error. Therefore, 
for fixed (P e ), the maximum achievable rate is ±_ h ^ P ^ . 

Consequently, the appropriate total gap is ±_ h ^ P n — R, which can be broken down as sum of two 'gap's 

TT7^-MT^k>T C } + {C - R} <27) 

The first term goes to zero as (P e ) — > and the second term is the intuitive idea of gap to capacity. 

The traditional approach of error exponents is to study the behavior as the gap is fixed and (P e ) — > 0. Considering 
the error exponent as a function of the gap reveals something about how difficult it is to approach capacity. However, 
as we have seen in the previous section, our bounds predict double-exponential improvements in the probability 
of error with the number of iterations. In that way, our bounds share a qualitative feature with the trivial bound of 
Section IV-Al 

It turns out that the bounds of Theorems Q] and [2] do not give very interesting results if we fix (P e ) > and 
let R —>■ C. We need (P e ) — ► alongside R — ► C. To capture the intuitive idea of gap, which is just the second 
term in (|27T) . we want to be able to assume that the effect of the second term dominates the first. This way, we 
can argue that the decoding complexity increases to infinity as gap — ► and not just because (P e ) — ► 0. For this, 
it suffices to consider (P e ) = gap 13 for > 1. Our proof actually gives a result for (P e ) = gap^ for any /3 > 0. 



D. Lower bound on iterations for regular decoding in the vicinity of capacity 

Theorems Q] and [2] can be expanded asymptotically in the vicinity of capacity to see the order scaling of the 
required neighborhood size with the gap to capacity. Essentially, this shows that the neighborhood size must grow 
at least proportional to unless the average probability of bit error is dropping so slowly with gap that the 

dominant gap is actually the ( rz^rp )) ~~ Cj term m (f2VT >. 

Theorem 4: For the problem as stated in Section [Till we obtain the following lower bounds on the required 
neighborhood size n for (P e ) = gapP and gap — > 0. 

For the BSC, 

. For < 1, n = O V 

. For/3>l,n = ^( los ^r ) ). 
For the AWGN channel, 
. For p < 1, n = O (-^ 

. For p > 1, n = Q 

Proof: We give the proof here in the case of the BSC with some details relegated to the Appendix. The 
AWGN case follows analogously, with some small modifications that are detailed in Appendix [TV] 
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Let the code for the given BSC P have rate R. Consider BSC channels G, chosen so that C(G) < R < C{P), 
where C(-) maps a BSC to its capacity in bits per channel use. Taking log 2 (•) on both sides of d7]) (for a fixed g), 



log 2 «P e ) P ) > log 2 {h- 1 (5(G))) -1-nD (g\\p) - ey^log, (^^y 



(28) 



Rewriting 



nD (g\\p) + ey^log, (^ifj) + lo §2 (( P e) P ) ~ ^g 2 {h^ 1 (5(G))) + 1 > 0. (29) 

This equation is quadratic in y/n. The LHS potentially has two roots. If both the roots are not real, then the 
expression is always positive, and we get a trivial lower bound of \/n > 0. Therefore, the cases of interest are 
when the two roots are real. The larger of the two roots is a lower bound on \fn. 



Denoting the coefficient of n by a = D (g\ \p), that of yfn by b = elog 2 ( ) , and the constant terms by 



c = log 2 ((P e ) p ) — log 2 (h^ 1 (5(G))) + 1 in (l29l . the quadratic formula then reveals 

- b + Vb 2 - 4ac 

n > . (30) 

2a 

Since the lower bound holds for all g satisfying C(G) < R = C — gap, we substitute g* = p + gap r , for some 
r < 1 and small gap. This choice is motivated by examining Figure [T2J The constraint r < 1 is imposed because 
it ensures C(g*) < R for small enough gap. 

Lemma 1: In the limit of gap — ► 0, for g* = p + gap r to satisfy C(g*) < R, it suffices that r be less than 1. 
Proof: 

C(g*) = C(p + gap r ) 

= C(p) + gap r x C'(p) + o(gap r ) 
< C(p)-gap = R, 

for small enough gap and r < 1. The final inequality holds since C(p) is a monotonically-decreasing concave-n 
function for a BSC with p < \ whereas gap r increases faster than any linear function of gap when gap is small 
enough. ■ 



In steps, we now Taylor-expand the terms on the LHS of d29l about g = p. 

Lemma 2 (Bounds on h b (p) and h' b 1 (p) from [39]): For all d > 1, and for all x £ [0, |] and y £ [0, 1] 

(31) 
(32) 

(33) 
(34) 

Proof: See Appendix IIII-AI 
Lemma 3: 

^4yr log 2 (gap) - 1 + K i + o(l) < log 2 (h^ 1 (5(g*))) < r log 2 (gap) - 1 + K 2 + o(l) (35) 



h b (x) > 


2.1 




h h (x) < 


2x 1_ 


1/d d/ ln(2) 


K\y) > 


d 

y- 1 


/ln(2)\^T 


K\y) < 


l 

2 y - 





where K\ = (log 2 ( g M) + l°g2 (^^1 J where cZ > 1 is arbitrary and K 2 = log 
Proof: See Appendix IlII-B I 



ee Appendix IIII-BI ■ 
Lemma 4: 

Db ' llrt = 2p(l-p)ln(2) (1+ ° (1)) - <M) 

Proo/:- See Appendix [TTFCl ■ 
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Fig. 12. The behavior of g* , the optimizing value of g for the bound in Theorem[T] with gap. We plot log((7 p[ — p) vs log(gap). The 
resulting straight lines inspired the substitution of g* = p + gap r . 



Lemma 5: 



log 2 



Proof: See Appendix IIII-DI 
Lemma 6: 



g*(l ~P) \ = 9 a P r 
p(l-g*)J p(l-p)ln(2) 



(l + o(l)). 



-^-Wlog 2 ( — j (1 + o(l)) < e < 

# (p) V Vs<w 



(d - l)K(p) 



'log, C_LVi+ (i)) 



where K(p) is from (fTOb . 

Proo/- See Appendix IIII-EI ■ 
If c < 0, then the bound (l30l) is guaranteed to be positive. For (P e )p = gap 13 , the condition c < is equivalent 



to 



/31og 2 (gap) - log 2 {h^ 1 (S(g*))) + 1< (37) 

Since we want (l37l) to be satisfied for all small enough values of gap, we can use the approximations in Lemma [3]-[6] 
and ignore constants to immediately arrive at the following sufficient condition 

d 



/51og 2 (gap) 



d-\ 



r log 2 [gap) < 



i.e. r < 



d ' 



where d can be made arbitrarily large. Now, using the approximations in Lemma [3] and Lemma [51 and substituting 
them into (l30l) . we can evaluate the solution of the quadratic equation. 

As shown in Appendix IIII-FI this gives us the following lower bound on n. 

'log 2 (X/gap)\ 



gap 2r J 



(38) 
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for any r < min{/3, 1}. Theorem [4] follows. ■ 

The lower bound on neighborhood size n can immediately be converted into a lower bound on the minimum 
number of computational iterations by just taking log a (-). Note that this is not a comment about the degree of a 
potential sparse graph that defines the code. This is just about the maximum degree of the decoder's computational 
nodes and is a bound on the number of computational iterations required to hit the desired average probability of 
error. 

It turns out to be easy to show that the upper bound of Theorem [3] gives rise to the same scaling on the 
neighborhood size. This is because the random-coding error exponent in the vicinity of the capacity agrees with 
the sphere-packing error exponent which just has the quadratic term coming from the KL divergence. However, 
when we translate it from neighborhoods to iterations, the two bounds asymptotically differ by a factor of 2 that 
comes from (|22~1 ). 

The lower bounds are plotted in Figure [13] for various different values of 3 and reveal a log — scaling to the 

gap 

required number of iterations when the decoder has bounded degree for message passing. This is much larger 
than the trivial lower bound of log log — but is much smaller than the Khandekar-McEliece conjectured — or 

9 a P 9 a P 

aav 1°&2 ( ) scaun § f° r tne number of iterations required to traverse such paths toward certainty at capacity. 




-7.5 -7 -6.5 -6 -5.5 -5 -4.5 -4 -3.5 

log 10 (gap) 



Fig. 13. Lower bounds for neighborhood size vs the gap to capacity for (P e ) — gap" for various values of (3. The curve titled "balanced" 
gaps shows the behavior for 1 _ h ^ P n — C = C — R, that is, the two 'gaps' are equal. The curves are plotted by brute-force optimization 
of {7), but reveal slopes that are as predicted in Theorem [4] 



VI. Conclusions and future work 

In this work, we use the inherently local nature of message-passing decoding algorithms to derive lower bounds 
on the number of iterations. It is interesting to note that with so few assumptions on the decoding algorithm and 
the code structure, the number of iterations still diverges to infinity as gap — ► 0. As compared to [40] where a 
similar approach is adopted, the bounds here are stronger, and indeed tight in an order-sense for the decoding model 
considered. To show the tightness (in order) of these bounds, we derived corresponding upper bounds that behave 
similar to the lower bounds, but these exploit a loophole in our complexity model. Our model only considers the 
limitations induced by the internal communication structure of the decoder — it does not restrict the computational 
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power of the nodes within the decoder. Even so, there is still a significant gap between our upper and lower bounds 
in terms of the constants and we suspect this is largely related to the known looseness of the sphere -packing bound 
[41], as well as our coarse bounding of the required graph diameter. Our model also does not address the power 
requirements of encoding. 

Because we assume little about the code structure, the bounds here are much more optimistic than those in [38]. 
However, it is unclear to what extent the optimism of our bound is an artifact. After all, [28] does get double- 
exponential reductions in probability of error with additional iterations, but for a family of codes that does not 
seem to approach capacity. This suggests that an investigation into expander codes might help resolve this question 
since expander codes can approach capacity, be decoded using a circuit of logarithmic depth (like our iterations), 
and achieve error exponents with respect to the overall block length [42]. It may very well be that expanders or 
expander-like codes can be shown to be weakly capacity achieving in our sense. 

For any kind of capacity-achieving code, we conjecture that the optimizing transmit power will be the sum of 
three terms 

P* = C-\R) + Tech(£ a, E node , R) ± A((P e ),R, £, a, E node ). 

• C^ 1 (R) is the prediction from Shannon's capacity. 

• Tech(£, a, E no d e , R) is the minimum extra transmit power that needs to be used asymptotically to help reduce 
the difficulty of encoding and decoding for the given application and implementation technology. Solving d20l ) 
and subtracting C~ 1 (R) gives a heuristic target value to aim for, but it remains an open problem to get a tight 
estimate for this term. 

• A((P e ) , R, £, a, E no de) is an amount by which we should increase or reduce the transmit power because we 
are willing to tolerate some finite probability of error and the non-asymptotic behavior is still significant. This 
term should go to zero as (P e ) — ► 0. 

Understanding the second term Tech(£, a, Enode, R) above is what is needed to give principled answers regarding 
how close to capacity should the transmitter operate. 

The results here indicate that strongly capacity-achieving coding systems are not possible if we use the given 
model of iterative decoding. There are a few possibilities worth exploring. 

1) Our model of iterative decoding left out some real-world computational capability that could be exploited to 
dramatically reduce the required power consumption. There are three natural candidates here. 

• Selective and adaptive sleep: In the current model, all computational nodes are actively consuming power 
for all the iterations. If there was a way for computational nodes to adaptively turn themselves off and 
use no power while sleeping, then the results might change. We suspect that bounding the performance 
of such systems will require some sort of neighborhood-oriented analogies to the bounds for variable- 
block-length coding [43], [44]. 

• Dynamically reconfigurable circuits: In the current model, the connectivity structure of computational 
nodes is fixed and considered as unchanging wiring. If there was a way for computational nodes to 
dynamically rewire who their neighbors are (for example by moving themselves in the combined spirit 
of [12], [45], [46]), this might change the results. 

• Feedback: In [16], a general scheme is presented that achieves an infinite computational error exponent 
by exploiting noiseless channel-output feedback as well as an infinite amount of common randomness. 
If such a scheme could be implemented, it would presumably be strongly capacity achieving as both the 
transmission and processing power could remain finite while having arbitrarily low average probability 
of bit error. However, we are unaware if either this scheme or any of the encoding strategies that claim 
to deliver "linear-time" encoding and decoding with an error exponent (e.g. [42], [47]) are actually 
implementable in a way that uses finite total power. 

2) Strong or even weakly capacity-achieving communication systems may be possible using infallible compu- 
tational entities but may be impossible to achieve using unreliable computational nodes that must burn more 
power (i.e. raise the voltages) to be more reliable [48]. 

3) Either strongly or weakly capacity-achieving communication systems might be impossible on thermodynamic 
grounds. Decoding in some abstract sense is related to the idea of cooling a part of a system [49]. Since an 
implementation can be considered a collection of Maxwell Daemons, this might be useful to rule out certain 
models of computation as being aphysical. 
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Finally, the approach here should be interesting if extended to a multiuser context where the prospect of causing 
interference makes it less easy to improve reliability by just increasing the transmit power. There, it might give 
some interesting answers as to what kind of computational efficiency is needed to make it asymptotically worth 
using multiterminal coding theory. 



Appendix I 

Proof of Theorem[0 lower bound on (P e ) P for the BSC 

The idea of the proof is to first show that the average probability of error for any code must be significant if 
the channel were a much worse BSC. Then, a mapping is given that maps the probability of an individual error 
event under the worse channel to a lower-bound on its probability under the true channel. This mapping is shown 
to be convex-U in the probability of error and this allows us to use this same mapping to get a lower-bound to the 
average probability of error under the true channel. We proceed in steps, with the lemmas proved after the main 
argument is complete. 

Proof: Suppose we ran the given encoder and decoder over a test channel G instead. 

Lemma 7 (Lower bound on (P e ) under test channel G.): If a rate-i? code is used over a channel G with 
C(G) < R, then the average probability of bit error satisfies 

(P e ) G > K 1 (5(G)) (39) 

where 5(G) = 1 — ^7^. This holds for any channel model G, not just BSCs. 

Proof: See Appendix II-AI ■ 

Let denote the entire message, and let x™ be the corresponding codeword. Let the common randomness 
available to the encoder and decoder be denoted by the random variable U, and its realizations by u. 

Consider the i-th message bit £>,;. Its decoding is performed by observing a particular decoding neighborhood^ 
of channel outputs y™ bd i . The corresponding channel inputs are denoted by x" bd ^ and the relevant channel noise by 
z nbd i = x nbd i © Ynbd i where © is used to denote modulo 2 addition. The decoder just checks whether the observed 
ySxH £ ^y,i{0,u) to decode to Bi = or whether y™ bdi S T> v ^(l,u) to decode to B{ = 1. 

For given x™ bcH , the error event is equivalent to z™ bdi falling in a decoding region fz,i(x^ bdi , 
bi,u) © x™ bdi . Thus by the linearity of expectations, (|39l can be rewritten as: 

^E^EE^ = ^)Pf( Z nbd, e V z ^ biii ^)M,u)) > K 1 (5(G)) . (40) 

i b{ « 

The following lemma gives a lower bound to the probability of an event under channel P given a lower bound 
to its probability under channel G. 

Lemma 8: Let A be a set of BSC channel-noise realizations z" such that Prc(A) = 5. Then 

Pv(A) > f (5) (41) 



p 

where 

x n-nD(aWv) ( "P\\ - 9) 



f(x) = - 2- nD ^\\P) I V£U!> ) (42) 

2 \g(i-p); 



is a convex-U increasing function of x and 



< x ) = J^TT lo g2 (~ )• (43) 



Proof: See Appendix II-BI 
Applying Lemma [8] in the style of d40l) tells us that: 



K(g) 



10 For any given decoder implementation, the size of the decoding neighborhood might be different for different bits i. However, to avoid 
unnecessary complex notation, we assume that the neighborhoods are all the same size n corresponding to the largest possible neighborhood 
size. This can be assumed without loss of generality since smaller decoding neighborhoods can be supplemented with additional channel 
outputs that are ignored by the decoder. 
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< P e>P = iE^EE Pr ^ = u )^( Z nbd, 4 e^(x" bd ,K,n),b^n)) 

^ 1 E ^ E E Pr (^ = ( Z nbd, G ^(^(bj, u), b k , «))). (44) 

i bf « 

But the increasing function /(•) is also convex-U and thus (1441 and d40l imply that 

(P e ) P > /(i^^^Pr(C/ = n)Pr(z^G^, i (x n V J (btn),bf,u))) 

> m 1 (5(G))). 

This proves Theorem [T] ■ 

At the cost of slightly more complicated notation, by following the techniques in [16], similar results can be 
proved for decoding across any discrete memoryless channel by using Hoeffding's inequality in place of the Chernoff 
bounds used here in the proof of Lemma [7] In place of the KL-divergence term D(g\\p), for a general DMC the 
arguments give rise to a term max x D(G X \\P X ) that picks out the channel input letter that maximizes the divergence 
between the two channels' outputs. For output-symmetric channels, the combination of these terms and the outer 
maximization over channels G with capacity less than R will mean that the divergence term will behave like the 
standard sphere-packing bound when n is large. When the channel is not output symmetric (in the sense of [13]), 
the resulting divergence term will behave like the Haroutunian bound for fixed block-length coding over DMCs 
with feedback [50]. 



A. Proof of Lemma\7\- A lower bound on (P e )(j- 
Proof: 

H(B k 1 ) - H(B k 1 \ YH = I(B k ;YT) < J(Xf; Yf ) < mC(G). 
Since the Ber(h) message bits are iid, H(B k ) = k. Therefore, 

-H(B k \Y™) > 1 - (45) 

k v 11 1 ; ~ R 

Suppose the message bit sequence was decoded to be B^. Denote the error sequence by B k . Then, 

B k = B k ®B k , (46) 

where the addition © is modulo 2. The only complication is the possible randomization of both the encoder and 
decoder. However, note that even with randomization, the true message B^ is independent of B^ conditioned on 
Yf • Thus, 

H(B k \Y?) = £T(SJ © B^Yf) 

= H(B k © Bi|Y™) + I(B k ; B^ |Y™) 

= H(B\ © B\\Yf) - H{B k \Y'i, B k ) + H(B k \Y™) 

= J(§5©Bj;Bj|Yf)+fl'(Bj'|Yf) 

> H(B k \Y™) 



~ v R 



This implies 



k 



i=l 



(47) 
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Since conditioning reduces entropy, H{Bi) > H(Bi\Y™). Therefore, 

ij>(iy>i-^>. (48 ) 

i=l 

Since Bi are binary random variables, H(B,f) = hb((P e ,i} G ), where /ife(-) is the binary entropy function. Since 
hb(-) is a concave-fl function, h~^(-) is convex-U when restricted to output values from [0, |]. Thus, (l48l) together 
with Jensen's inequality implies the desired result d39l ). ■ 



B. Proof of Lemma^ a lower bound on (P e ,i) P as a function of (P ej i) G . 
Proof: First, consider a strongly G— typical set of z™ bdi , given by 

n 

T £iG = {z™ s.t. J2^~ n 9< (49) 

i=l 

In words, 7^ G is the set of noise sequences with weights smaller than ng + e^/n. The probability of an event A 
can be bounded using 

6 = Pr^GA) 

= Pr(Z?€Anr e , G )+Pr(Z?€An^ G ) 



Consequently, 



< Pr(Z? ein T 6 , G ) + Pr(Z? e T e <; G ). 

Cr Or 



Pr(Z? Gin T e , G ) > 5 - Pr(T^ G ). (50) 

Gr Or 



Lemma 9: The probability of the atypical set of Bernoulli-^ channel noise {Z{\ is bounded above by 



Pr ( ^i Zi ~ n 9 >e )< 2 -K{ 9 )J (51) 



where K(g) = inf 

W 0< V <l-g ^ 

Proof: See Appendix II-CI 
Choose e such that 



Thus (l50l becomes 



2-^(3)^ 2 = * 
2 

= «b (52) 

Pr(Zl€AnT«, G )>|. (53) 



Let n z ™ denote the number of ones in z". Then, 



Pr(Z? = z?)=^?(l- 5 r-^?. (54) 

Cr 
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This allows us to lower bound the probability of A under channel law P as follows: 

Pr(Z n E A) > Pr(Z? eAn T e>G ) 

_ ^ Pr P (z?) 

~ 4- Pr G (z™) G {Zl) 



1 { P) Y>r(v\ 



> 



(l_p)™ /^f-i _ „\\ n 9+tV^ 



G V i; U l-B 



(1-5-)™ ^ G Vfffl ~ p) J 

(1-P)" /P(l-5)\ n9+e ^ , t ^ ^ 



(i- 5 )™ V5(i-p)7 g 

" 2 Vfl(l-P) 



This results in the desired expression: 



where e(x) = * / wt log 2 (§). To see the convexity of f(x), it is useful to apply some substitutions. Let c\ 



K(g) 

2 "° (al|p> > o and let £ = \J K ^ i n ^ m ( g(ilp) )■ Notice that £ < since the term inside the In is less than 1. Then 

f(x) = c\x exp(^Vlri 2 — lnx). 
Differentiating /(x) once results in 



f'(x) = d exp I ^/m(2) + ln(-) | (1 + I = ). (56) 



x 



2,/ln(2)+ln(i; 



By inspection, /'(a;) > for all < x < 1 and thus f(x) is a monotonically increasing function. Differentiating 
f(x) twice with respect to x gives 



ciexpta/ln(2)+ln(i)) ( l * 

2A /ln(2)+ln(i) V 2 ( ln ( 2 )+ ln ^)) 2 Jln(2) + ln(I] 1 ' 



Since £ < 0, it is evident that all the terms in ( 1571 ) are strictly positive. Therefore, /(•) is convex-U. 

C. Proof of Lemma\9} Bernoulli Chernoff bound 

Proof: Recall that Zi are iid Bernoulli random variables with mean g < 1/2. 



Pr ( a ^''^)' p, ( a '^"' 1 ^) 



(58) 



where e = y^ne and so n = e 2 /? 2 . Therefore, 

p/ EjOZi-ff) > v < r,^ _ x +5exp ( s \) x exp(-s( 5 +?))]" for all s > 0. (59) 
y/n 

Choose s satisfying 

exp^J^x^-l). (60, 
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It is safe to assume that g+e < 1 since otherwise, the relevant probability is and any bound will work. Substituting 
(l60l into (|59]> gives 

Pr ( ^ !_ ^ ; > e ) < 2 ^ — e 

in 



This bound holds under the constraint % = n. To obtain a bound that holds uniformly for all n, we fix e, and take 
the supremum over all the possible ? values. 

Pr — - — — > e < sup exp(-ln(2) ^ e) 

V V™ J 0<e<l~g ^ 



< exp(— m(2)e mi ^ 



0<?<l-g e c 



giving us the desired bound. 



Appendix II 

Proof of Theorem [2j lower bound on (P e ) P for AWGN channels 

The AWGN case can be proved using an argument almost identical to the BSC case. Once again, the focus is on 
the channel noise Z in the decoding neighborhoods [51]. Notice that Lemma [7] already applies to this channel even 
if the power constraint only has to hold on average over all codebooks and messages. Thus, all that is required is 
a counterpart to Lemma [8] giving a convex-U mapping from the probability of a set of channel-noise realizations 
under a Gaussian channel with noise variance (Jq back to their probability under the original channel with noise 



variance a 



Lemma 10: Let A be a set of Gaussian channel-noise realizations z" such that Ptg(A) = 5. Then 

Pr(A)>f(6) (61) 

where 

f(5) = S - exp(-nD(<4||4)-^(§ + 21n0))^-l)). (62) 

Furthermore, f{x) is a convex-U increasing function in 5 for all values of a 2 G > a p. 

In addition, the following bound is also convex whenever Oq > apu(n) with u(n) as defined in (fT3l) . 

where <p(n, 5) is as defined in (fT6l ). 

Proof: See Appendix III-AI ■ 

With Lemma [10] playing the role of Lemma [U the proof for Theorem [2] proceeds identically to that of Theorem [JJ 
It should be clear that similar arguments can be used to prove similar results for any additive-noise models for 
continuous output communication channels. However, we do not believe that this will result in the best possible 
bounds. Instead, even the bounds for the AWGN case seem suboptimal because we are ignoring the possibility of 
a large deviation in the noise that happens to be locally aligned to the codeword itself. 



f L (5) = - eM-nD(a 2 G \\a 2 P ) - -<j>(n,6) f 4f - 1 )) (63) 



A. Proof of Lemma U0\ a lower bound on (P e j)p as a function of {P ei i) G 
Proof: Consider the length-n set of G-typical additive noise given by 

r tf = { lf: W^I< ! }. (64 ) 

With this definition, ([50l continues to hold in the Gaussian case. 

There are two different Gaussian counterparts to Lemma |9l They are both expressed in the following lemma. 
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Lemma 11: For Gaussian noise Z, with variance a G , 



1 n 7 2 / 
Pr t^^-> 1 + 4)< (l + 4)exp(-4)) . (65) 



Furthermore 



for all e > 



1 " 



2 > 1 + — ) <exp(-^-) (66) 



n ^ CT G ^G 4cJ G 



Proof: See Appendix III-BI 

To have Pr(T^ c G ) < |, it suffices to pick any e(5, n) large enough. 
So 



Pr(^) > / /p(W 



n 

f / ' I V. . 'II S 

-P ( n 



//>(Z?) /g(z?W- (67) 



Consider the ratio of the two pdf 's for z™ G T^ g 




G 



1 

2cr G J V°"p 



> exp ( — (na G + ne(<5, n)) ( — — - -^j- ) + nln ( 



exp -- V-r- -T" 1 -^(^olkp) (68) 



where D(a G \ |op) is the KL-divergence between two Gaussian distributions of variances a G and op respectively. 
Substitute <gSJ back in ([67]) to get 

Pr(A) > exp (4- A- ^(4114)) / 



> |exp(-„ D (4||.|,)-^(^-l]]. (69) 



At this point, it is necessary to make a choice of e(S, n). If we are interested in studying the asymptotics as n 
gets large, we can use ( |66b . This reveals that it is sufficient to choose e > a G max(-^=, _4 ln (' 5 M n ( 2 ) a sa f e b e t 

is e = Gq 3+ ^ i - > or ne(S,n) = ^(3 + 41n(|))<7 G . Thus ([53]) holds as well with this choice of e(<5,n). 
Substituting into d69l ) gives 

Pr(A) > ^exp (-nL>(<x G |l4) - ^n(| + 2 In " 1)) ■ 

This establishes the desired /(•) function from (l62l . To see that this function /(x) is convex-U and increasing 

in x, define c x = exp(-nD(a G \\cr 2 P ) - ^(| + 2 In (2)) - l) - ln(2)) and £ = 2V« (5? - l) > 0. Then 

/(<5) = ci<5exp(£ln(<5)) = c\5 1+ ^ which is clearly monotonically increasing and convex-U by inspection. 

Attempting to use (1631) is a little more involved. Let e = -4- for notational convenience. Then we must solve 



(1 + e)exp(— e) = (f)~. Substitute u = 1 + ? to get uexp(— u + 1) = (|)~- This immediately simplifies to 
—uexp(-u) = — exp(— 1)(|)~. At this point, we can immediately verify that (|)~ G [0,1] and hence by the 
definition of the Lambert W function in [34], we get u = — Wl(— exp(— 1)(|) »). Thus 

?(5,n)=-W^(-exp(-l)(|)t)-l. (70) 
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Substituting this into ( f69T > immediately gives the desired expression d63l ). All that remains is to verify the convexity. 
Let v = |(~?" — l)- As above, /z,(<5) = 602 exp(— nv7(5, n)). The derivatives can be taken using very tedious 



manipulations involving the relationship W' L (x) = ^(l^vt^Oz)) ^ rom $4] and can be verified using computer-aided 
symbolic calculation. In our case —7(5,n) = (Wl(x) + 1) and so this allows the expressions to be simplified. 

2v 

f L (6)=c 2 eM-nvZ)(2v + l + —). (71) 

Notice that all the terms above are positive and so the first derivative is always positive and the function is increasing 
in 5. Taking another derivative gives 

4v 4 2 



t»rx\ 2t>(l + e)exp(-m;e) 
JlW = c 2 



l + 4v+ ^ 

e ne ne z 



(72) 



57 

Recall from (ITOl ) and the properties of the Lambert Wl function that 7 is a monotonically decreasing function of 
5 that is +oo when 5 = and goes down to at 5 = 2. Look at the term in brackets above and multiply it by the 
positive tie 2 . This gives the quadratic expression 

(4u + l)ne 2 + 4(vn-l)e-2. (73) 

This (1731 is clearly convex-U in 7 and negative at ? = 0. Thus it must have a single zero-crossing for positive 7 
and be strictly increasing there. This also means that the quadratic expression is implicitly a strictly decreasing 
function of 5. It thus suffices to just check the quadratic expression at 5 = 1 and make sure that it is non-negative. 
Evaluating (T70T > at 5 = 1 gives 7(1, n) = T(n) where T(n) is defined in (fT4l . 

It is also clear that (1731 is a strictly increasing linear function of v and so we can find the minimum value for 
v above which (1731) is guaranteed to be non-negative. This will guarantee that the function fi is convex-U. The 
condition turns out to be v > ^wpjm and hence a 2 G = a 2 P (2v + 1) > ^f-(l + jrpr + n T\T+i) )■ ^ s matcnes U P 
with (fT3l) and hence the Lemma is proved. ■ 



B. Proof of Lemma 177} Chernoff bound for Gaussian noise 

Proof: The sum X^=l ^ s a standard x 2 random variables with n degrees of freedom. 

<(a) 

<(6) 

where (a) follows using standard moment generating functions for x 2 random variables and Chernoff bounding 
arguments and (b) results from the substitution s = „, | v This establishes d65l ). 

For tractability, the goal is to replace (1741) with a exponential of an affine function of -4-. For notational 
convenience, let ?= -4-. The idea is to bound the polynomial term \fl + 7 with an exponential as long as 7> e*. 

Let e* = -7= and let iT = i - t4=. Then it is clear that 

Vl+7< exp(KT) (76) 

as long as 7 > e*. First, notice that the two agree at 7= and that the slope of the concave-H function \/I + 7 
there is \. Meanwhile, the slope of the convex-U function ex.p(K7j at is K < |. This means that exp(Ke) starts 
out below \/l + ?. However, it has crossed to the other side by 7 = e* . This can be verified by taking the logs 



*>0 \ y/1 - 2 S 



1 + 



G 



G 



(i + -|-) ex p( — %■) 



G 



cr. 



G 



(74) 
(75) 
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of both sides of d76l ) and multiplying them both by 2. Consider the LHS evaluated at e* and lower-bound it by a 
third-order power-series expansion 

3 . 3 9 9 

ln(l + — ) < — - — + 



x/ri y/n 2n n 3 / 2 

meanwhile the RHS of (l76l ) can be dealt with exactly: 

2Ke* = (l--L)JL 
V 2^' ^ 

3 3 
xfn 2n 

For n > 9, the above immediately establishes d76T ) since ^ — ^ = ^ > The cases n = 1, 2, 3, 4, 5, 6, 7, 1 
can be verified by direct computation. Using d76l ). for ?> e* we have: 

Pr(T^ G ) < [exp(^?)exp(-i?)]" 



exp(-^- ( 77 ) 



Appendix III 
Approximation analysis for the BSC 

A. Lemma |2] 

Proof: ( f3TT ) and d34l are obvious from the concave-H nature of the binary entropy function and its values at 
and \. 

h b (x) = xlog 2 (1/x) + (1 - x)log 2 (1/(1 - x)) 
< {a) 2x log 2 (1/x) = 2x ln(l/x)/ ln(2) 

< (6) 2xd(-i^-l)/ln(2) Vd>l 

< 2x 1 - 1 / d d/ln(2). 

Inequality (a) follows from the fact that x x < (1 — x) 1 ~ x for x G (0, \). For inequality (b), observe that ln(x) < 
The bound on h h l {x) follows immediately by identical arguments. 



,i — 1. This implies \n{x l l d ) < x x l d — 1. Therefore, ln(x) < d{x l / d — 1) for all x > since \ < 1 for d > 1 



B. Lemma\3\ 
Proof: 

First, we investigate the small gap asymptotics for 6(g*), where g* = p + gap r and r < 1. 

C(g*) 



= i 
= l 

= l 



R 

C(p + gap r ) 
C{p) - gap 

C(p) - gap r h' b (p) + o{gap r ) 
C(p)(l-gap/C(p)) 



h' (v) 

1 - (1 - -^frgap r + o(gap r )) x (1 + gap/C{p) + o( 5 ap)) 
C[p) 



C(p) 



gap r + o(gap r ). (78) 
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Plugging d78T ) into ( f34T > and using Lemma |2] gives 



^{K 1 (%*))) < log 2 (Jj^gap r + o(gap r )\ (79) 
= log 2 (^|) +rlog 2 (gap) + o(l) (80) 
= rlog 2 ( 5 ap)-l + log 2 (^M) +o(l) (81) 



and this establishes the upper half of 0351 . 
To see the lower half, we use (1331) : 



log 2 {K 1 (%*))) > ^r( lo g2W)) + log 2 (^)) 



<i (■ (Kip). - A - /ln2 



d - 1 V log2 V cXpT^ + 0(90,0 J + log2 

jk ( r log s ^ + ^ (§§) + °® + ^ (jf ) 

r log 2 - 1 + + o(l) 



d- 1 

where if i = (log 2 (§^) + log 2 (^)) and d > 1 is arbitrary. 

C. Lemma [?] 
Proof: 

D(g*\\p) = D(p + gap r \\p) 



1 qap*' , 9 „. 

= ° + 0X9aP + 2 p(l-p)H2) + ° igaP) 
since |p) = and the first derivative is also zero. Simple calculus shows that the second derivative of D(p+x\ 
with respect to x is , , ] SM& — v . 

D. Lemma\5\ 
Proof: 

( g*{l-p) \ /l-p\ , ( g* 



p(l-9*)J VP/ 

!og 2 ( ~~J + log 2 (^*) - log 2 (1 - 5*) 

log 2 + lo §2 (P + 9 a P r ) - lo g2 (1 - P - ga>P r ) 

log 2 f ^— ^ + log 2 (p) + log 2 f 1 + ^ - log 2 (1 - p) - log 2 f 1 



I ' OZ VI / ' OZ I 1 / OZ V X-/ OZ \ -| 

P / VP/ V i — P 

+ T, \1 — + o(#ap ) 



pln( 2 ) ( 1 -P) ln ( 2 ) 
gap 



p(l - p) ln(2) 

ggp r 

p(l-p)ln(2) 



+ o(5ap r ) 
(l + o(l)). 
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E. Lemma\6\ 

Proof: Expand ©: 



and similarly 



> 



< 



K(p + gap r ) ^ 



log 



hx{2)K{p + gap r 

I 1 
\n.(2)K(p + gap r 

\n(2)K(p + gap T 

ln(2)K(p + gap T 



ln(2)K(p + gap r 

I 1 
ln(2)K(p + gap T 

' rd 



K\5(G))) 
^H2)-Hh~\5{G))) 

Vln(2) - rln(gap) + ln(2) - K 2 ln(2) + o(l) 



rln( — ) + (2-K 2 )ln(2) + o(l) 
5«P 



rln(— )(l+o(l)). 
5«P 



ln(2)-ln(/ lb - 1 (5(G))) 



'(2-# 2 )ln(2) + 



— r In ( 

1 Vfap 



+ o(l) 



V ln ( 2 )( d " 1)^(P + V ^ {gap) (1 + 

All that remains is to show that K(p + gap r ) converges to K(p) as gap — ► 0. Examine (fTOb . The continuity of 
^(g+^llg) j s c i ear [ n interior rj £ (0, 1 — g) and for g G (0, i). All that remains is to check the two boundaries. 



'/- 

lim^o 



— ! '-■ — 1 - by the Taylor expansion of D(g + jj\ \g) as done in the proof of Lemma[4] Similarly, 



lim 7? _>i_ s D ( 9 +v\\9) = Z)(l||<jr) = log 2 ^jr^J- Since K is a minimization of a continuous function over a compact 
set, it is itself continuous and thus the limit lim 9fip ^o K(p + gap r ) = K(p). 

Converting from natural logarithms to base 2 completes the proof. ■ 



£(1-9) In 2 



F. Approximating the solution to the quadratic formula 
In (l30l ). for g = g* = p + gap r , 

a = D(g*\\p) 

, , ( g*{i-P) 

h = elog2 Uw) 

C = log 2 ((P e ) p )-log 2 (/ lfc - 1 (%*)))+l. 

The first term, a, is approximated by Lemma [4] so 



2r I 

a = gap ( 



1 



2p(l - p) ln(2) 



+ *(!))• 



(82) 



Applying Lemma [5] and Lemma [6] reveals 

b < 



rd 



(d ~ l)K(p) 
1 



'log. 



gap' 



\gapj p(l -p) ln(2) 



rd 



p(l - p) ln(2) V (d - mU 9 ^ bg2 (i) (1 + 



b > 



The third term, c, can be bounded similarly using Lemma [3] as follows, 

c = 0log 2 (gap) -log 2 (h^(6(g*)))+l 

< (j^r - 0) log 2 (—) +K 3 + o(l) 
d-l Vff a ?V 

> (r-/3)log 2 f— ) +^ 4 + (i). 
for a pair of constants K3, K4. Thus, for gap small enough and r < 7 , we know that c < 0. 



The lower bound on y/n is thus 



n > 



Vb 2 - 4ac - b 



2a 



b 

2a, 



Aac \ 



Plugging in the bounds (1821 ) and (1841 ) reveals that 

b 



2a 

Similarly, using (f82l) . (l84b . (l85l ). we get 

4ac 



> 



log 2 f ^ 



V 9 a v ) \ r 



gap' 



K(p) 



6 2 



< 



( P (l- P )ln(2); 


x 




+ ^3 




( p(l-p) ln(2) ) K(p) 9 a P lo §2 ( gap j 


(l + o(l)) 



4p(l - p)if (p) ln(2) 

stant since r • 
Plugging dSU) and ([89]) into ([87]) gives: 



d p 



d — 1 r 



This tends to a negative constant since r < ^ d d l ' . 



n > 



" lQ g2 (555 



^feL 1 



(l + o(l)) A /l + 4p(l-p)ln(2)^(p) 



/? d 







(log 2 {I I gap)) 2 



W + Ap(l-p) \u(2)K(p) 



rd 
d-l 



r d-l 
-Vr] (l + o(l)) 



gap 



2r 



for all r < min{-f-, 1}. By taking d arbitrarily large, we arrive at Theorem @] for the BSC. 
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Appendix IV 

Approximation analysis for the AWGN channel 
Taking logs on both sides of (TTTT > for a fixed test channel G, 

ln((P e ) P ) > Hh^(5(G))) - ln(2) - nD(a 2 G \\a 2 P ) - ^ + 21n2 - 21n(^ 1 ( ( 5(G))) - ij , (91) 

Rewriting this in the standard quadratic form using 

a = D(a 2 G ,\\a 2 P ), (92) 
b = (^ + 21n2-21n(/ l - 1 (5(G))))^-l), (93) 

c = ln((P e ) P )-ln(/ lfo - 1 (5(G)))+ln(2). (94) 

it suffices to show that the terms exhibit behavior as gap — > similar to their BSC counterparts. 

For Taylor approximations, we use the channel G* , with corresponding noise variance o G » = a 2 P + £, where 

C = 9ap — ) . (95) 

Lemma 12: For small enough gap, for ( as in (I93T ). if r < 1 then C(G*) < P. 
Proo/: Since C{P) - gap = R > C(G*), we must satisfy 

l n ( Pt\ 1, / Pt 
5 ap < 2 l0g2 i 1+ 4J-2 l0g2 ( 1 + 4TC 



So the goal is to lower bound the RHS above to show that d95l ) is good enough to guarantee that this is bigger 
than the gap. So 



2V l0g2 l 1 + ^J" l0g2 l 1 + 4 + P r 
= i (log 2 (l + 2^(1 + - log 2 (l + 2gaf^ 

For small enough gap, this is a valid lower bound as long as c s < 1. Choose c s so that 1 < c s < p °f a , For £ as 
in d95l ), the LHS is gap r K and thus clearly having r < 1 suffices for satisfying ( 1961 ) for small enough gap. This 
is because the derivative of gap r tends to infinity as gap — > 0. ■ 



In the next Lemma, we perform the approximation analysis for the terms inside (192b . (1931 and (I94I ). 
Lemma 13: Assume that a G » = op + ( where ( is defined in d93T ). 

(a) 



*p V Pr 

(b) 

ln(<5(G*)) = r ln(sap) + o(l) - In(C(P)). (98) 

ln^^G*))) > — -rHgap) + c 2 , (99) 
for some constant C2 that is a function of d. 

hi{h^\5{G*))) < rln(gap) + c 3 , (100) 
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for some constant C3. 

(d) 

D(a 2 G ,\W 2 p) = {PT+ J 2p)2 gap 2r {\ + (101) 

Proof: (a) Immediately follows from the definitions and d95l ). 
(b) We start with simplifying 5(G*) 

S(G*) = \-9v±l 
R 

C-gap- |log 2 (l + ^) 
C- aap 



C-gap 

C- aap 

C- aap 

i( CTM pf+^) + °(C)) -g^ 

C-gap 

I If r 2a p (P T + a p ) P T \ gap 

= ^2 [ 9ap — y T — 4(p+4) + 0{9ap } J " 5ap)(1 " — + o{gap)) 

= *f{i + *!)). 

Taking ln(-) on both sides, the result is evident. 

(c) follows from (b) and Lemma |2] 

(d) comes from the definition of D(<Jq.\\<j p ) followed immediately by the expansion \n(<jg, / a P ) = ln(l + 
C/c"p) = -4 — |(-4-) 2 + o(gap 2r ). All the constant and first-order in gap r terms cancel since = 1 + -4-- This 
gives the result immediately. ■ 

Now, we can use Lemma [13] to approximate d92l ), d93l and d94j ). 



P 2 



<?ap 2r (l + o(l)) (102) 



T 

2(Pr + <r 2 p) 



^ + 21n2-21n(/ l - 1 (5(G)))) gap r 



Pt 



< 2d(P T + 4) r J_ 1 + 
(d - 1)Pt oap 

6 > 2 ( f! r + CT p) r ln (^ )gap r (l + o(l)) (104) 
P T gap 

c < (-*-. r -p)M—)(l + o(l)) (105) 
a — 1 gap 

c > (r-/3)ln(— )(l + o(l)). (106) 
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Therefore, in parallel to d88l ), we have for the AWGN bound 



2a ~ (F T + a 2 P ) \ gap r 



) 



(l + o(l)). 



(107) 



Similarly, in parallel to (l89l) . we have for the AWGN bound 



This is negative as long as r < 1 d '- , and so for every cs < ^ for small enough gap, we know that 




Combining this with (11071 ) gives the bound: 




(108) 



(109) 



Since this holds for all < c s < \ and all r < min(l, ^ d d ^ ) for all d > 1, Theorem @] for AWGN channels 
follows. 
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